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Abstract 

We consider estimation of the covariance matrix of a multivariate random vector 
under the constraint that certain covariances are zero. We first present an algo- 
rithm, which we call Iterative Conditional Fitting, for computing the maximum 
likelihood estimator of the constrained covariance matrix, under the assumption of 
multivariate normality. In contrast to previous approaches, this algorithm has guar- 
anteed convergence properties. Dropping the assumption of multivariate normality, 
we show how to estimate the covariance matrix in an empirical likelihood approach. 
These approaches are then compared via simulation and on an example of gene ex- 
pression. 

Some key words: Covariance graphs; Empirical likelihood; Graphical models; Marginal indepen- 
dence; Maximum likelihood estimation; Multivariate normal distribution 



1 Introduction 

In this paper we consider estimation of the covariance matrix of a random vector, subject 
to certa in entries being set t o ze ro. Such restricti ons appear, for example, in recent 



to certa in entries being set t o ze ro, bucn restricti ons appear, tor example, m recent 
work bv lGrzebvk et al and lMao et al.l (j2nn4h . Suppose we have a random vector 
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Figure 1: The covariance graph for the matrix in 



X = (Xi, X2, X3, X4)' € whose covariance matrix S exhibits the zero pattern 
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It is often helpful to visualize the patt ern of zeros by a so-called cov ariance graph, 



especially for larger covariance matrices ( Cox and WermuthL 199,'j . 



19961 ) ■ A covariance 

graph has one vertex for each one of the random variables in the random vector. In 
the above example, the vertex set is y = {1,2,3,4}, where the random variable Xi is 
identified with its index i. Next, each pair of vertices (i,i) € x 1/, i ^ j, is connected 
by an edge unless aij = 0. Assuming that the covariance matrix in (|1.1() has no zeros 
other than those indicated explicitly, its covariance graph is given in Figure [H Here we 
use bi-directed edges in keeping with the p ath diagram notation used by ^W right (1923); 
other authors have used dashed edges; see ICox and WermuthI (jl 99.i . 1199^. 

We define a covariance graph model as the set of joint distributions in which the as- 
sociated zero restrictions hold in the covariance matrix. In the absence of an assumption 
of normality, the model does not have a Markov interpretation. 

The Gaussian covariance graph model is the family of all multivariate normal dis- 
tributions AA(/u, S) such that Uij = whenever i 3 and i j. Clearly, o"jj = if 
and only if Xi and Xj are marginally independent; in symbols Xi^LXj. Hence a Gaus- 
sian covariance graph model is a graphical model based on marginal independence in 
contrast with graphical models based on undirected graphs (Markov random fields), di- 
rected acyclic graphs (DAGs, Bayesian networks), or chain graphs, where the absence of 
an edge between two vertices generally indicates some conditional inde pende nce between 
the associated variables (Edwards, 2000; Lauritzen, 1996; Whittakeil. Il99nl ). 

Maximum likelihood (ML) estimation in Gaussian covariance graph models is not 
well developed: the conceptual simplicity of these models belies the fact that, in contrast 
to undirected graph models, they form curved exp onential families. For instance, the 
graphical modelling software MIM ( Edward^. 2000l . S7.4) permits fittin g of such models 
only by a heuristic "dual likelihood" method due to KauermannI (|l996l ^. There is, how- 
ever, an algorithm due to lAndersonl(ll969l . ll97nl .l Il973f ) that can be used to compute the 
ML estimate in models defined by linear hypotheses on covariance matrices, hence also 
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in covariance graph models. However, it is unclear when this algorithm converges and 
when its limit points are positive semi-definite matrices. Such issues become more press- 
ing when mis-specified models are fitted, as will be the case in a specification search. In 
this paper, we introduce a new algorithm for ML estimation in covariance graph models, 
called Iterative Conditional Fitting (ICF) , which does not suffer from the same problems 
as Anderson's algorithm. 

For situations in which multivariate normality does not hold, estimates may still be 
obtained via procedures based on normality such as ICF and dual estimation but the 
behaviour of these methods is then unclear. As an alternative, we present an approach, 
based on empirical likelihood ( OwenL 200lh . which provides consistent estimates even 
without normality. We compare the different estimation methods on real and simulated 
data. 



2 Covariance graph models 
2.1 Non-parametric model 

Suppose that we observe a random vector Yy = {Yi \ i G V)' G M^, indexed by V, 
and with joint distribution Py. Let S(Py) = {(Jij) € M^^^ be the unknown covariance 
matrix. Let G = {V, E) be a graph with the variable set V as vertex set and the edge set 
E (^V \ {{i^i) I i E V} consisting exclusively of bi-directed edges (x, j), (j, i) G E, 
denoted by i ^ j- Let 'PiV) be the cone of positive definite V x V matrices and let 
P(G) be the cone of all matrices S G P(V^) which fulfill the linear restrictions 

i^j =^ aij=Q. (2.1) 

The covariance graph model M(G) associated with the bi-directed graph G is simply 
the family of joint distributions 

M(G) = {Py |S(Py)GP(G)}. (2.2) 

We consider estimation of the unknown parameter S = S(Pv') based on a sample of 
observations Y^^^'' G M^, A; G iV = {1, . . . , n}, that are i.i.d. according to Py G M(G). 
The set N can be interpreted as indexing the subjects on which we observe the variables 
in V . We group the vectors in the sample as columns in the V N random matrix Y 
so that 

Var(y) = S /tv- (2.3) 

Here, 1^ is the N x N identity matrix and is the Kronecker product. Thus the i-th 
row Yi G of the matrix Y contains the i.i.d. observations for variable i ^ V on all 

(k) 

the subjects in N and the fe-th column Yy ' holds all the observations made on subject 
k N. Finally, the sample size is n = lA'^j and the number of variables is p = \V\. 
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2.2 Gaussian model 



We define a Gaussian covariance graph model as the multivariate normal submodel 



N(G) = (A/V(/", S) 1 S G P(G)) C M(G). 



(2.4) 



The log-likelihood function i of the covariance graph model N(G) is a function from 
X P(G) to M and can be expressed as 



£(/.,S) = -^log(27r)-^log|S| 



n 



(2.5) 



see e.g. lEdward^ ()2nnnl . §3.1). Here 5 



IS 



5 = -(y - ljv)(y - /X Itv)' G 



fVxV 



n 



(2.6) 



where Ijv = (1, • • • , 1)' G ^ • For any given value of S, ()2.5() is maximized by setting 
ji = Y ^ R^, i.e., the vector of the row means of Y . Hence, the profile log-likelihood for 
S, ^(S) is obtained by replacing S with 



S = -{¥ -Y ®In){Y - Y e 
n 



(2.7) 



in (|2.5() . Working with the profile likelihood corresponds to fitting the submodel of N(G) 
in which /i = and adjusting the sample size to n — 1. 

If S is positive definite, which will occur with probability 1 if n > p+\ ( Eaton and Perlmanl . 

I973I ). then the global maximum of ^(S) over P(G), i.e., the ML estimator of E, exists. 
In general, the condition n > p + 1 is not necessary for almost sure existence of the ML 
estimator but we are not aware of anv results in t he literature which provide a neces- 
sary and sufficient condition (compare Buhl 1993h . In the sequel we will assume S to 
be positive definite. Note that since the model N(G) is a curved, but not necessarily 
regular, expon ential family, the likelihood function mav, and in fact can, have multiple 
local maxima (|Drtonl . l2nn.4 IPrton and EichardsonL I2n04h . 
Let 



F = {{i,i) \ i^V}\J{{i,j) (iV^\i<j M^j} 



(2. 



be the pairs of vertices indexing unrestricted elements in the matrix S G P(G). The 
cardinality of F is equal to the number of vertices plus the number of edges in the graph 
G. The unrestricted elements of S form the vector 



(2.9) 



In order to write derivatives of the log-likelihood function in compact form we introduce 
the matrix Q with entries in {0, 1} that satisfies vec(S) = Qa, where vec is the operator of 
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column-wise matrix vectorization. The columns of Q that are associated with a variance 
an contain exactly one entry equal to one, whereas a column of Q that is associated 
with a covariance 0"^, i 7^ j, i <-> j, contains exactly two entries equal to one. If the 
graph G is complet e, i.e., all possib le edges are present in G, then Q is the duplication 



grapn Cr is complet e, i.e., ail possiDie cages 
matrix described in Harvillj (|l997 . p. 352). 



The first derivative of the log-likelihood function, that is, the score function can then 
be written as 



n 



Q' [vec(S-i5S-i) - vec(S-i)] , 



(2.10) 



see 



da 2 

Harvillj (Il997l . § 15) for details on the necessary matrix differential calculus. It follows 



(2.11) 



that the likelihood equations di{T,)/da = are 

(S-i),, = (S-i^S-i),,, {i,j)eF; 

compare also lAnderson and Olkinl (jl985l . §2.1.1). The full matrix S is determined by 
ij = for (z, j) ^ F, that is for i j and i <A j. 

The second derivative of ^(S) can be computed using results from Harvillel ( 1997 . 
jl5.9), and we find that the Hessian matrix equals 

a2£(s) 



(T. 



da'^ 



^-l^ 



-li 



p-i® (S-i5S-i)]}Q. (2.12) 



Its negated expectation under 7VV(0, S), the Fisher-information, equals 



-E 



a^£(s) 



n 



(2.13) 



and can be used for normal approximation to the distribution of roots of the likelihood 
equations. Sections S 05I focus on the computations of such roots. 



3 Existing estimation methods for Gaussian covariance graphs 



We are aware of only one specialized algorit hm for ML estim ation applicable to covari- 
ance graph models. This algorithm is due to lAndersonI (|l973l ') and will fit any Gaussian 
mode l obtained from a linear hypothesis on the covariance matrix ( Andersonl . 19691 . 

In this section, we describe the incarnation of Anderson's algorit hm that fits 
covari ance graph models. We also review a dual estimation method due to KauermannI 
(|l996h . which produces estimates that are unique and asympto tically effic i ent th ough, 
in general, not solutions to the likelihood equations. Note that ICox et al. I have 
recently proposed moment based estimators in the special case where the graph is a 
chain, or equivalently, the covariance matrix is tri-diagonal under a suitable ordering. 



5 



3.1 Anderson's algorithm for ML estimation 

Each iteration of Anderson's algorithm solves a system of linear equations built from the 
current estimate of S. In the case of covariance graphs, the linear equations are solved 
for the vector a of unrestricted elements in S, compare (|2.9|) . and can be specified as 
follows. Let o"*-' = and A = be the F x F matrix with entries 

_ / a^'^a^'' if k = e, 



(ijM) I ^ik^je ^ ^jk^ii if ^ _^ 

Here and {k,i) are elements of F. Furthermore, let 6 = 6s be the F x 1 vector 
with components 

hj = {i:-'S^-%, {i,j)€F (3.2) 

From lAndersonl(ll973l 'l. it follows that S G P(G) solves A-^a = b^, if and only if S solves 



the likelihood equations (|2.11j) . 

This motivates the following iterative scheme. Start with some G P(G). Iter- 
atively update the current estimate Yp"^ to S^'""*"^-' determined by the linear equations 

^^„^('-+i)=6^„. (3.3) 



A fixed point of this algorithm solves the likelihood equations (|2.11|) . As starting value, 
Anderson suggests the identity matrix, i.e., S^'^^ = Iv- In the first step, his algorithm 
constructs the empirical estimate E^-^^ with (T^|^ = Sij, G F. However, neither S*^^) 
nor any subsequent estimate of E has to be positive (semi-) definite and thus may not 
be a valid covariance matrix. Moreover, at any given stage, the likelihood function may 
decrease, and convergence of Anderson's algorithm cannot be guaranteed. 

3.2 Kauermann's dual estimation 

Dual estimation is based on the maximization of a dual likelihood function, which is 
motivated by interchan ging the role of the parameter matrix S and the empirical covari- 
ance matrix S in (|2.5|) ( Kauermannl . 199fil . §4). Procedurally, dual estimation, also called 
minimizing the discriminant information, amounts to finding the matrix E^uai S P(G') 
that solves the equations 

i^dLih = iS-%, y{i,j)&F, (3.4) 

while satisfying that (Eduai)ii = for all F. Contrary to ()2.11|) . the equation 
system ()3.4() always has a unique so l ution that can be found by the iterative proportional 
fitting algorithm; see also lEdward^ toodl . §7.4). In particular, if the covariance graph is 



decomposable, then iterative proportional fitting will terminate in finitely many steps, 
and the dual estimator Sdual is available in closed form. 
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4 Iterative conditional fitting for Gaussian covariance graphs 



In this section, we present the new Iterative Conditional Fitting (ICF) algorithm for ML 
estimation, which is guaranteed to produce positive definite roots of the Hkehhood equa- 
tions of covariance graph models. We begin by explaining the idea of iteratively fitting 
conditional distributions that stands behind ICF, and then show how the algorithm can 
be implemented using simple least squares computations. 

4.1 The idea of iterative conditional fitting 

Starting with some initial estimate of the joint distribution, the idea of ICF is to repeat- 
edly iterate through all vertices i eV, and 

(i) Fix the marginal distribution for the variables different from i, i.e., the variables 
-i = V\{i}; 

(ii) Estimate, by maximum likelihood, the conditional distribution of variable i given 
the variables —i under the constraints implied by the covariance graph model 
N(G); 

(iii) Find a new estimate of the joint distribution by multiplying together the fixed 
marginal and the estimated conditional distribution. 

Since we fix the marginal distribution of variables —i in the update for variable i, all 
marginal independences amongst the variables —i still hold true after the update. There- 
fore, only the marginal independences involving variable i lead to constraints for the 
estimation in step (ii). 

In order to make the idea more precise, let T,a,b denote the A x B submatrix of S 
and Ya denote the ^ x submatrix of Y, where A,B CV. Clearly, 

Y_i - M-ixN{0, ^-i-i ® In)- 

Hence, step (i) simply fixes the value of i.e., everything but the i-th row and col- 

umn of S. As S-i.-i remains unchanged in the i-th update many of the zero constraints 
imposed on the covariance matrix trivially hold true also after the update. 
The conditional distribution of Yi given YLj is the normal distribution 

{Yi I Y_i) ~ M{,y^^{B,Y_i, XaN), (4.1) 

where 

Bi = Ei,_,(S_i,_0-' e R»^-^ (4.2) 
is the {i} X —i matrix of regression coefficients, and 

Xi = an - _j(S_j _j)"^S_j,j G (0, oo) (4.3) 
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is the conditional variance. If the graph G was the complete graph G in which an edge 
joins any pair of vertices then the mapping 

P(G) = P(y) ^ (0, oo) X M«x-^ X P_i(G), 

would be bijective and the regression in 1)4. 1|) a standard least squares regression. Here, 
Pa(G) is the set of all j4 x ^4 submatrices of matrices in P(G), A C.V . For a general 
graph G, ()4.4|) is no longer bijective and (|4.1() is not a standard regression because we 
need to respect the restriction E G P(G), i.e., the restrictions aij = if j G —i, j i. 
However, this can be circumvented using synthetic pseudo-variables that are computed 
from the data YLj and the fixed matrix 



4.2 Pseudo-variable regressions 

Instead of working with the regressions coefficients Bi, we exploit the fact that Bi equals 
Ej^_j multiplied by the inverse of the fixed submatrix Let sp{i) = {j | i <-> j} be 

the set of spouses oi i £ V and let nsp(i) = \ (sp(i) U {i}) be the set of non-spouses, 
yielding the partition V = {i} U sp(i) U nsp(z). Then the conditional expectation of 
(Yi I Y-i) can be written as 

jesp{j) 

(i) 

where the pseudo-variable Zj is equal to the j-th row in 

Z« ) = [(S_,_,)-i],p(,),_,y_, G M^pW>^^. (4.6) 
In (|4.5|) . we exploit that aij = if j G nsp(i). From (|4.5|) . we obtain 

(y, |y_,)~A/'Rxiv( ,A,/^). (4.7) 

iesp(i) 

Let F-i{G) be the set of —i x — z submatrices of the matrices in P(G). Then the mapping 



P(G) ^ (0,oo) X rWx'^pW X P-i{G) 

(4.8) 

Tj I— I- (Aj, Sj^sp(j), S_j^_j) 

is a bijection, which implies that the parameters aij, j G sp{i), and Aj are variation 
independent in (|4.7() . Therefore, if is fixed to equal some given matrix in P_j(G), 
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then ()4.7|) constitutes a standard normal regression model whose parameters cjjj, j € 
sp(i), and Aj can be estimated by the usual least squares formula. The estimate of Aj 
yields an estimate of an by solving (|4.3() for an. Thus, we obtain the ML estimator of 
the i-th. row and column of S when is fixed. In particular, after updating the i-th 

row and column we are still left with a matrix S € P(G). 

4.3 The iterative conditional fitting algorithm 

Let S^*") be the estimate of S after the r-th iteration and S the estimate of S after 
the i-th update step of the r-th iteration in ICF, i.e., after estimating (Yi \ Y^i). 

Algorithm 1. The ICF algorithm can be implemented as: 

1. (Initialization) Set the iteration counter r = 0, and choose a starting value S^'^^ € 
P(G), e.g. the identity matrix f^^^^ = ly- 

2. (Updates) Order the variables as y = {1, . . . set S^''''^) = S^*"), and repeat the 
following steps for all i = 1, . . . ,p : 

(i) Let = and calculate from this submatrix the pseudo-variables 

^sp{i) o.(^'^ording to \4-(^ - 
(a) Compute the ML estimators 

±'^''\., = Yi{z^'\.y\z^'\.jz'-'\.y]-\ 

i,sp{t) sp(i)^ L sp(j)^ sp(i)^ J ' 

1 (4.9) 

A- = —(Y- - s'''''*'' z''*-* )(Y- - i;'''''*^ z^^^ y 

for the linear regression j^- The existence of the matrix inverse follows 
from the assumed non- singularity of the sample covariance matrix S. 

(Hi) Complete by setting 

a';'^'^ = Xi + t^'''\.Mt^f ,)-^] (4.10) 

a * i,sp(i) Jsp(j),sp(j) sp(j),i ' ' 

compare 

3. (Repeat) Set = Increment the counter r to r + 1. Go to 2. 

The iterations can be stopped according to a criterion such as "the estimate of E is 
not changed" (in some pre-determined accuracy). 
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Example 2. Figure[21illustrates ICF for the model N(G) based on the graph G shown in 
Figure^ The algorithm cycles in arbitrary order through the four regressions {Yi \ Yli), 
i = 1,2,3,4. In Figure 12 a filled circle represents variables in the conditioning set 
— i, and an unfilled circle stands for the variable i forming the response variable in the 
considered regression. The directed edges coincide with bi-directed edges in the original 
graph in Figure ^ and indicate the pseudo- variable regressions to be carried out. The 
vertices that are joined to vertex z by a directed edges are labelled with the pseudo- 
variables that act as covariates. The directed edges are labelled with the covariances 
that are estimated. 




Figure 2: Illustration of the pseudo- variable regressions in ICF. 



Remark 3 (Complexity). The algorithm can be restated only in terms of the empirical 
covariance matrix S defined in 1)2. 7|) . For example in 1)4. 9() . 

^sp(i)(^sp(i)) = [{^-i-i) ]sp(j),-i5'-i,-i[(S-i,-i) ]-i,sp(i)- 

Other products between data matrices appearing in the sequel can be similarly expressed 
in terms of the empirical covariance matrix S. Thus, the sample size does not affect the 
complexity of the algorithm. The complexity of one of the algorithm's pseudo-variable 
regression steps is dominated by the computation of the inverse of Ti-i-i in 1)4. 6() . and 
the inversion of a matrix of size sp(z) x sp(i) (j4.9|) . Note that may be sparse 

and special methods for inversion of sparse matrices might be useful. In particular, if 
the induced subgraph G-i has disconnected components then only the submatrices of S 
over connected components containing spouses of i have to be inverted. 

4.4 Convergence 

The key to prove convergence of ICF is to recognize that the algorithm consists of 
iterated partial maximizations over sections of the parameter space P(G'). In ICF we 
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repeatedly maximize the likelihood function of the covariance graph model partially by 
allowing only the entries in the i-th row and column of S to vary. The remaining entries 
are fixed. A bit more formally, we consider the parameter space 

e = {S G P(G) [ > (4.12) 

which is compact, though not necessarily connected, and contains the global maximizer 
of i{Ti). Recall that we assume the empirical covariance matrix S to be positive definite. 
Defining the section 0i(Xl) C as 

ei(S) = {S G G I E_,,_i = (4.13) 

it becomes clear that the algorithm steps 2(i)-2(iii) maximize the log-likelihood function 
partially over the section 0j(S(^'*~^)), i.e. 

S^*^'*) = argmax{£(S) | S G Qiit'-'^^'-^'^)} . (4.14) 

This local and global maximizer over the section is unique. If a matrix S G P(G) 
maximizes the log-likelihood function over all sections 0i(S), i V, simultaneously, 
th en it solves the likel i hood equations. Hence, the following theorem follows from results 



m 



Drton and Eichlen ()2005l . Appendix). 



Theorem 4. Suppose the sequence (S^^^) is constructed by the ICF algorithm. Then all 
accumulation points o/(S^'"^) are saddle points or local maxima of the log-likelihood func- 
tion. Moreover, all accumulation points have the same likelihood value. In particular, if 
the likelihood equations have only finitely many solutions, then (S*^'"-') converges. 



5 Iterative conditional fitting with multivariate updates 

The algorithm presented in ^is based on updating one row and column of an estimate 
of the covariance matrix S G P(G) by carrying out a univariate regression. A natural 
modification of this approach is to update several rows and columns of the estimate 
E G P(G) simultaneously using multivariate regression. 

5.1 Seemingly unrelated pseudo- variable regressions 

Let C C y be a subset of the vertices. In order to estimate all rows and columns of S 
that are indexed by the vertices in C in the ICF algorithm presented in ^ we have to 
carry out several univariate pseudo- variable regressions for (Y^ | y_j), i £ C. Instead, 
we would like to consider only one multivariate regression of the form (Yq \ Y-c)i where 
— C = V\C. The conditional distribution 

(Yc I y_c) ~ AfcxNiBcY^cJ^c <S) In) (5.1) 
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is specified by the matrix of regression coefficients 



Be = Sc,-c(S-c,-c)"' G R^""-^, (5.2) 

and tlie conditional covariance matrix 

Ac = - Sc,-c(S-c,-c)"'S„c;,c G P(C). (5.3) 

In order for the conditional distribution 1)5. Ij) to be of a simple structure, there should 
be no constraints on the Ac, in which case P(C) = Pc(G). This holds if there are no 
constraints on the submatrix Sc,C) which in turn holds if the set C is complete, i.e., if i 
j whenever i,j G C and i ^ j. Then the only constraints on the conditional distribution 
1)5. Ijl are on the matrix of regression coefficients Be and stem from restrictions that 
aij = 0, if i G C, J C and j <A i. 
Let 

sp(C) = [U(sp(i) I i G C)] \C (5.4) 

be the spouses of C, that is the vertices that are not in C but adjacent to some vertex 
in C, and let nsp(C) = \ (sp(C) U C) be the non-spouses of C, yielding the partition 
y = CUsp(C)U nsp(C). If we define the pseudo- variables 

= [{^-c,-c)-Xic),-c Y-c G M«P(^)x^, (5.5) 
then we can rewrite ()5.H) as 

{Yc I y-c) ~ Mc^n{T.cmc) ^sSc) ' ® In): (5.6) 

because ^c,-nsp{C') =0- As S ranges through P(G), the submatrix ^c^gp^c) playing the 
role of regression coefficients in 1)5. 6() ranges through the linear space 

PcMC)iG) = {Ae R^x«P(^) I A,, = if i j}. (5.7) 

Hence, 1)5. 6() constitutes seemingly unrelated regressions (IZellneil . ll963) . 



5.2 The iterative conditional fitting algorithm with multivariate up- 
dates 

ML estimation in seemingly unrelated regressions i tself gen e rally requires iterative algo- 
rithms, such as iterating the two-step estimator of IZellner In the case of ()5.6)) . 



the two-step estimator consists of first estimating Sc,sp(c) for some fixed Ac by gener- 
alized least squares, and then estimating Ac as the empirical covariance matrix of the 

(c) 

residuals Yi — Sc,sp(c)'^sp(c) computed with the estimate of 5^c,sp(c) obtained in the first 
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step. However, if the current estimate of S is used to obtain starting values T,Q^sp{c) 
and Ac, then the two-step method does not have to be iterated in order to obtain esti- 
mates for the seemingly unrelated pseudo-regressions 1)5. 6|) that yield a convergent ICF 
algorithm with multivariate updates. For specification of the estimator of ^c,sp{c) 
need to introduce the matrix Pq of the linear map that sends the vector of unrestricted 
elements in '^c,sp(C) fo matrix '^c,sp(C) ^ 'Pc,sp(C){C!). The vector of unrestricted 
elements of ^c,sp{c) is the vector ac = {cTij | i G C, j € sp(C), i ^ j). The matrix 
Pc has exactly one entry equal to one in each column, the other entries are zero, and it 
satisfies Yec{T,c sp{c)) — Pc^C for S € P(G); compare the definition of the matrix Q in 

m 

In order to run ICF with multivariate updates, we have to choose a family of complete 
sets {C \ C ^ C) such that 

U(C7 I C G C) = y, (5.8) 

where the sets C do not have to be disjoint. For example the sets C could be chosen 
as edges, but the largest possible choice for the sets C would be the cliques, i.e., the 
maximal complete sets, in G. 

Algorithm 5. For a given choice of C, the ICF algorithm with multivariate updates can 
he implemented as: 

1. (Initialization) Set the iteration counter r = 0, and choose a starting value S^'^-' G 
P(G), e.g. the identity matrix S*-*^-* = /y. 

2. (Updates) Order the sets in the family C as C = {Ci, . . . , Cq}, set T,^^'^^ = $](^), 
and repeat the following steps for all G C; 

covariance matrix Ac,, according to L5. and the pseudo-variables Z^'^j^^^^ 



(i) Let = From this submatrix, compute the conditional 

covariance matrix Ac\ according to 15. a 
according to J 5. ,5)) . Calculate Clc^. = {Ac^.)~^ 

Compute the (c 
PCk^Ck, where 



(a) Compute the (generalized least squares) matrix that satisfies vec(5][^^^]p^^^^) 



-1 

X 



{^c.vecfc>c(^S(k))1}. (5.9) 
( Hi ) Compute the empirical covariance matrix of residuals 

^fe n Ck,sp{Ck) sp(Cfe)^v C\,sp{Ck) sp(Cfe)^ ; 
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(in) Complete by setting 



Ac. + S 



{r,k) 



compare \5.^J\) . 

3. (Repeat) Set = S^'"''?). Increment the counter r to r + 1. Go to 2. 

Note that if the family C consists of only singletons then Algorithm [S] reduces to 
Algorithm ^ 

Example 6. We take up the covariance graph shown in Figure ^ For the family C 
of complete vertex sets, several choices are possible. If the cliques C = {13,34,24} are 
chosen, then all conditional distributions considered in ICF are bivariate, whereas for 
C = {1, 2, 34} two univariate distributions are estimated in conjunction with a bivariate 
distribution. For the clique choice C = {13,34,24}, we illustrate the seemingly unre- 
lated pseudo- variable regressions to be estimated in Figure 01 which is to be interpreted 
similarly as Figure [21 An addi tional feature are the bi - direct ed edges that connect the 
vertices in the sets C £ C; see iRichardson and Spirte for a formal definition of 

these graphs. 




Figure 3: Illustration of the seemingly unrelated pseudo- variable regressions in ICF with 
multivariate updates, and C = {13,34,24}. 



5.3 Convergence 

The ICF algorithm with multivariate updates is still an iterative partial maximization 
algorithm. However, the sections in the parameter space over which maximizations are 
performed are not quite as simple as the sections described in H4.4I Steps 2(ii) and 2(iii) 
of Algorithm do not jointly maximize the log-likelihood function i over sections of the 
form 

Gc(S) = {S G e I S_c,-c = S_c,-c}. (5.12) 
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Instead step 2(ii) maximizes i over sections of tlie form 

ei,c(S) = {S G e [ S_c,-c = S^c,-c, Ac = Ac}, (5.13) 

where Ac is again the conditional covariance matrix from (|5.3() . The subsequent step 
2(iii) maximizes i over sections of the form 

e2,c(S) = {S G e I S_c,-c = S_c,-c, ^c,-c = Sc.-c}- (5.14) 

Nevertheless it holds under condition ()5.8|) that if S maximizes the log-likelihood func- 
tion i over both section Qi^ci^) and 02,c(5^) simultaneously for all C G C, then S 
is a solution to the likelihood equations. Thus, Theorem 21 holds also for ICF with 
multivariate updates as stated in Algorithmic 



6 Empirical likelihood estimation 

In contexts where it is not appropriate to assume multivariate normality, we may still 
wish to estimate a covariance matrix subject to various zer o restrictions. Here we present 
an approach based on empirical likelihood ( Owenl . 200 il l. In the resulting method an 



estimate of the underlying distribution is obtained by maximizing a non-parametric 
likelihood under constraints that include the desired zero covariance restrictions; see 
Chaudhuri. Handcock. and Rendal3 ( 2005h and Heller stein and Imbenj ( 19991 ) for similar 



applications of empirical likelihood. 

We associate a weight with the fe-th sample observation Yy , k a N . Estimating 
the mean vector and covariance matrix simultaneously, we solve the nested constrained 
maximization problem 



.=(Si"",.„) n (6-1) 

k&N ) 



subject to 



u;fc>0,fcGiV, (6.2) 
^Wk = l, (6.3) 

Wk (yf ^ - ^i) = 0, V i G y, (6.4) 

^ Wk (yf ^ - fli) (yf ^ - fij) =0, yi,jev s.t. i j. (6.5) 

keN 

Without the additional constraints (|6.4() and ()6.5() . the empirical likelihood ratio IlfceA'' "''"^'^ 
is maximized for Wk = l/ra, k N. The additional constraint (|6.4|) enforces that the 
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mean of the reweighted rows of Y is equal to /x. Constraint ()6.5|) ensures that the 
estimated weights Wk are such that the empirical covariance matrix of the reweighted 
sample satisfies the zero constraints specified by the graph G. 

In order to avoid obvious problems with feasibility of the optimization problem, we 
assume that the sample size, i.e., the number of weights, is strictly larger than the 
number of constraints in (|6.3() and 1)6.5(1 . Note that the number of constraints (|6.5|) may 
grow quadratically as the number of variables increases. The nesting of the maximization 
steps in ()6.1|) is done to avoid cubic constraints in w, which would have resulted had we 
substituted 

= ^«;fcy/'^ yieV (6.6) 

in (|6.5|) and made the constraints in ()6.4|) redundant. The constrained maximization 
problem can be solved through its dual problem, in which the nu mber o f unkn owns 
is equal to the numbe r of constraints of the original problem; see OwenI ( 2001 ) and 



Chaudhuri et al.l (|2005|) for details. 

If fi and w are, respectively, the vectors of mean and weights maximizing ()6.1|) under 
the constraints (|6.2|1 - (|6.5() . then the estimated covariance matrix is given by 

tE = (Y - fK^lN) ■ diag(w) ■ {Y - fi(S) In)' , (6.7) 

where diag( w) is an n x n diagonal matrix with w along its diagonal. Following lOwenI 
( 2001 ) and lOin and Lawless! (|l994l ) one can show that asymptotically fi and S e are 



consistent. 



7 Data and simulations 

We now compare the three approaches to estimation of a covariance matrix with zeros in 
a data example and in simulations: (i) ML estimation relying on ICF, (ii) dual likelihood 
estimation as described in ^3.21 and (iii) empirical likelihood estimation. 



7.1 Gene expression in yeast 

Gasch et al. ( 2000h present gene expression data from microarray experiments with yeast 



strands. We focus on p = 8 genes related to galactose utilization. The gene GALll is 
responsable for transcription. The genes GAL4 and GAL80 are involved in galactose 
regulation. Gene GAL2 is related to transport and the remaining four genes, GALl, 
GAL3, GAL7, and GALIO, are involved in galactose metabolism. There are n = 134 ex- 
periments with gene expression measurements for all eight genes. The observed marginal 
correlations and standard deviations are shown in Tabled where we denote the variables 
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Table 1: Observed marginal correlations and standard deviations. 





Xn 


Xi 


-^80 


X2 


Xi 


^3 


X7 


















Xi 


0.24 














-'^80 


0.08 


0.23 












X2 


-0.18 


-0.03 


0.26 










Xi 


-0.10 


-0.10 


0.28 


0.87 








X3 


-0.18 


0.12 


0.20 


0.44 


0.39 






X7 


-0.07 


-0.08 


0.21 


0.81 


0.88 


0.50 






-0.08 


-0.07 


0.26 


0.87 


0.92 


0.46 


0.91 


SD 


0.39 


0.36 


0.47 


1.70 


1.70 


0.78 


1.85 



for the gene expression measurements by Xi, i = 1,2, 3, 4, 7, 10, 11, 80, using the obvious 
correspondence. 

By multiple testing of correlations as described in iDrton and Perlma 3 (|2004l . \2QQ^ 
and implemented in the R package 'SIN', we selected the two covariance graphs Gg C Gd 
that are illustrated in Figured The larger graph Gd contains all edges shown, i.e., both 
the solid and the dashed edges, whereas the sub-graph Gg includes only the solid edges. 
In Gs the vertices 1, 2, 3, 7, and 10 form a clique and in Gd the clique is enlarged to 
include vertex 80. With the R package 'ggm' and additional code, we computed the 
ML, the dual, and the empirical likelihood estimates of the covariance matrix under the 
zero constraints specified in the graphs. The results for both Gg and Gd are shown in 
Table |2l We remark that we started ICF from the identity matrix and that Anderson's 
algorithm gave the same results as ICF. However, although we refer to "ML estimates", 
ICF is only guaranteed to find a stationary point which may not be the global maximizer 
of the likelihood. 

Upon inspection of Table |21 we find that the three estimates are in better agreement 
for the graph Gd. This graph yields the better fitting covariance graph model. The 
deviance of the model N(Gd) under comparison to the model based on the complete 
graph equals 9.98 over 9 degrees of freedom, whereas the deviance of the model N(Gs) 
equals 33.07 over 13 degrees of freedom. This indicates a good fit of N(Gd) and a poor 
fit of the more restrictive model N(Gs). The difference in log- likelihood between ML 
and dual estimates equals 4.29 in N(Gs) and reduces to 0.51 in N(Gd). In contrast the 
difference in log-likelihood between ML and empirical likelihood estimates equals 20.54 
in N(Gs) and 5.67 in N(Gd). 
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Table 2: Marginal correlations and standard deviations from ML (M), dual (D), and em- 
pirical likelihood (E) estimates for graph Gg (lower half) and graph Gd (upper italicized 
half). 
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Figure 4: Covariance graph for data in Tabled 



7.2 Simulations 



Since the ML estimator Y^m and Kauermann's dual estimator are based on a normal- 
ity assumption, but the empirical likelihood based estimator is not, it is of interest 
to compare their performance, both when the underlying distribution is, and is not, 
Gaussian. We simulated 1000 data sets for sample sizes n = 10, 20, 25, 30, 50, 100 from 
a multivariate normal distribution, and a multivariate t distribution with 5 degrees of 
freedom {t^). The mean vector was zero and the covariance matrix for the multivariate 
normal distribution was 


1 



/I 


1 

2 

Vo 




1 

4 



1 



4 
3 
4 

1/ 



(7.1) 



corresponding to the graph shown in Figure ^ For the distribution, we used S 
as dispersion matrix, which results in the covariance matrix |E. In Figure [3 and IHl 
we present the bias and root-mean-squared error (RMSE) respectively for the three 
estimators (off-diagonal entries are considered once ). For the heavier-tailed mu ltivariate 
distribution, moments up to fourth order exist (jKotz and Nadaraiahl . 2004h . thus it 
makes sense to consider RMSE of the estimated variances and covariances. For sample 
size 10 we experienced problems with the empirical likelihood procedure, resulting from 
an inability to find feasible starting values. Consequently we do not present results for 
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(a) Gaussian 



(b) t. 



Figure 5: Comparing the bias of the ML, DL and the EL estimator for various sample 
sizes. ML ( — M — ), DL (- -D- -) and EL (• • • E- • • )• 



TiE when n = 10. 

Prom Figures infa) and (b) it is evident that the biases of S^; and T,d are larger than 
the bias of T,m for all values of n. Whereas in the Gaussian case, T,e behaves better in 
terms of bias than T,d for n > 20, the opposite happens under the is distribution. As 
would be expected the RMSE of T,m is slightly lower than that of S^) and 'Se under 
Gaussianity; cf. Figure 6(a) On the other hand T,e performs better in terms of RMSE 
than T.M for all sample sizes when the underlying distribution is t^; see Figure |6(b)} 
The RMSE of S^; is also smaller than that of S/j under t^, for moderately large sample 
sizes (n > 20). 



8 Discussion 



We have considered three methods for estimating a covariance matrix with pre-specified 
zeros. In a Gaussian covarian ce graph model both ML estimation and the dual likelihood 
method of KauermannI ( 199(il 'l provide efficient estimates of the covariance matrix. If the 
assumption of multivariate normality is not reasonable, then non-parametric estimates 
can be obtained in an empirical likelihood approach. 

For the problem of maximizing the likelihood function of a Gaussian covariance graph 
model we have introduced the new Iterative Conditional Fitting (ICF) algorithm, which 
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(a) Gaussian 



(b) t5 



Figure 6: Comparing the root mean squared errors of the ML, DL and the EL estimator 
for various sample sizes. ML ( — M — ) DL ( D ) and EL(- ■ ■ E- • • ). 



can be implemented in both a univariate as well as a multivariate version. The advantage 
of multivariate ICF is the maximization of the likelihood function over larger sections 
of the parameter space; the disadvantage is the overhead in carrying out generalized 
least squares computations as opposed to the standard least squares computations of 
univariate ICF. Future practical experience will show whether general recommendations 
in this trade off can be given, but the structure of the particular covariance graph 
considered will certainly be important. 

Besides its clear convergence properties, strengths of ICF include the fact that the 
covariance matrix estimates are positive definite at any stage of the algorithm and that 
only tools from least squares regression are required for implementation. In addition, it 
is very appealing that ICF extends the duality betwee n covariance graph and undirected 
(concentration) graph models (cf. iKauermannL 199fil ) to the level of fitting algorithms. 
The commonly used method for fitting undirec ted graph models, the iterative propor- 
tional fitting (IFF) algorithm ( Whittakei . 199d . pp. 182-185), fits marginal distributions 
while fixing conditionals. ICF does exactly the converse. The abstract idea behind ICF 
can be expressed in terms of marginal and conditional distributions which suggests that 
it is not limited in any way to Gaussian covariance graph models. In fact, work by the 
authors on applying ICF in binary graphical models for marginal independence appears 
promising. 
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The ICF algorithm resembles the Iterative Conditional Modes (ICM) algorithm of 
Besad (|l98(ih . However, ICM obtains maximum a posteriori estimates in a Bayesian 



framework, whereas our ICF maximizes a likelihood function, which constitutes a very 
differently structured problem. Another r elated algorith m is the Conditional Iterative 
Proportional Fitting (CIPF) algorithm of ICrameil |l998l . EoOQ ). CIPF can be used to 
maximize the likelihood function of a model that comprises joint distributions with pre- 
scribed conditional distributions. However, CIPF differs from ICF because the update 
steps of ICF do not simply equate a conditional distribution with a prescribed condi- 
tional, but rather maximize a conditional likelihood function that will generally not be 
the same in two different iterations of ICF. 

It is obviously a most attractive feature of the empirical likelihood procedure that it 
does not require multivariate normaliy. Algorithmically, empirical likelihood estimation 
is more involved than maximum likelihood and dual estimation. In particular, we had 
difficulties obtaining empirical likelihood estimates for smaller sample sizes, which is 
related to a fundamental difference between empirical likelihood estimation and the 
other two methods based on multivariate normality. Both ML and dual estimation 
are possible whenever the sample covariance matrix is positive definite, which occurs 
with probability one if the sample size is larger than the number of variables, and may 
occur for smaller sample sizes if the covariance graph is disconnected. In contrast, 
the optimization problem to be solved for empirical likelihood estimation may become 
infeasible if the sample size is small compared to the number of constraints imposed. The 
number of constraints depends on the covariance graph, and seemingly simpler sparser 
structures impose more constraints and render the empirical likelihood approach more 
sample size-demanding. 

Not surprisingly, our simulations show that the ML estimates computed with ICF 
are preferable if the underlying distribution is indeed multivariate normal. When simu- 
lating from a multivariate t distribution instead non-parametric estimation via empirical 
likelihood gave the best results in terms of mean squared error. 
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